首页> 外文OA文献 >The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

【2h】

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

机译：现代高性能计算系统中批处理BLAS的设计和性能

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

A current trend in high-performance computing is to decompose a large linear algebra prob-lem into batches containing thousands of smaller problems, that can be solved independently,before collating the results. To standardize the interface to these routines, the community isdeveloping an extension to the BLAS standard (the batched BLAS), enabling users to performthousands of small BLAS operations in parallel whilst making efficient use of their hardware.We discuss the benefits and drawbacks of the current batched BLAS proposals and performa number of experiments, focusing on GEMM, to explore their affect on the performance. Inparticular we analyze the effect of novel data layouts which, for example, interleave the ma-trices in memory to aid vectorization and prefetching of data. Utilizing these modificationsour code outperforms both MKL and CuBLAS by up to 6 times on the self-hosted Intel KNL(codenamed Knights Landing) and Kepler GPU architectures, for large numbers of DGEMMoperations using matrices of size 2 Ã� 2 to 20 Ã� 20.

机译：高性能计算的当前趋势是将大的线性代数问题分解为包含成千上万个较小问题的批处理，这些问题可以在整理结果之前独立解决。为了使这些例程的接口标准化，社区正在开发对BLAS标准（批处理的BLAS）的扩展，使用户能够并行执行数千个小的BLAS操作，同时有效利用其硬件。我们讨论了当前的优缺点。批处理了BLAS提案并进行了许多针对GEMM的实验，以探讨它们对性能的影响。特别是，我们分析了新颖的数据布局的效果，例如，对存储器中的矩阵进行交织以辅助数据的矢量化和预取。利用这些修改，我们的代码在自托管的英特尔KNL（代号为Knights Landing）和开普勒GPU架构上的性能要比MKL和CuBLAS高出6倍，对于使用大小为2×2到20×20的矩阵的大量DGEMM操作而言。

著录项

作者
Dongarra, Jack; Hammarling, Sven; Higham, Nicholas J.; Relton, Samuel D.; Valero-Lara, Pedro; Zounon, Mawussi;
展开▼
作者单位

展开▼
年度 2017
总页数
原文格式 PDF
正文语种 en
中图分类

相似文献

外文文献
中文文献
专利

1. The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems [J] . Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Procedia Computer Science . 2017,第期

机译：现代高性能计算系统中批量BLA的设计和性能
2. Optical interconnections within modern high-performance computing systems [J] . Lytel R., Davidson H.L. Proceedings of the IEEE . 2000,第6期

机译：现代高性能计算系统中的光互连
3. Analysis of NUMA effects in modern multicore systems for the design of high-performance data transfer applications [J] . Tan Li, Yufei Ren, Dantong Yu, Future generation computer systems . 2017,第SEPa期

机译：分析现代多核系统中的NUMA效应，以设计高性能数据传输应用
4. Design of a High-Performance Tensor-Vector Multiplication with BLAS [C] . Cem Bassoy International Conference on Computational Science . 2019

机译：BLAS的高性能张量矢量乘法的设计
5. A Method of Evaluation of High-Performance Computing Batch Schedulers [D] . Futral, Jeremy Stephen 2019

机译：一种高性能计算批处理调度程序的评估方法
6. FPGA-Based High-Performance Embedded Systems for Adaptive Edge Computing in Cyber-Physical Systems: The ARTICo3 Framework [O] . Alfonso Rodríguez, Juan Valverde, Jorge Portilla, 2018

机译：基于FPGA的高性能嵌入式系统用于网络物理系统中的自适应边缘计算：ARTICo3框架
7. The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems [O] . Dongarra, Jack, Hammarling, Sven, Higham, Nicholas, 2017

机译：现代高性能计算系统中批量BLas的设计与性能
8. Integration of Tools for the Design and Assessment of High-Performance, HighlyReliable Computing Systems (DAHPHRS), Phase 1 [R] . Scheper, C., Baker, R., Frank, G., 1992

机译：集成用于设计和评估高性能，高可靠性计算系统（DaHpHRs）的工具，第1阶段

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅